!pr1
Fast 6502 & 65802 Multiply Routines........Bob Sander-Cederlof

Since multiplication is not a built-in function in the 6502, 65C02, or 65802, many of us have written our own subroutines for the purpose.  I will present some efficient subroutines here, to handle the 8-bit and 16-bit cases.

I will assume both arguments are the same length (either 8-bits or 16-bits) and that we want the full product.  If the arguments are only 8-bits long, the product will by 16-bits long.  If the arguments are 16-bits long, the product will be 32-bits long.  I will also assume the arguments are unsigned values.  Thus $FF times $FF will be $FE01 (in decimal, 255x255 = 65025).

Way back in February, 1981, I published an article with a Brooke Boering's fast 16-bit multiplication subroutine.  His subroutine duplicated the functions of the subroutine in the original Apple Monitor ROM, but was nearly twice as fast.  Brooke's programs were originally published in the December, 1980, Micro magazine (now defunct).  He included an 8-bit multiply subroutine with an average time of only 192 cycles.

Damon Slye wrote an article for Call APPLE, published June, 1983.  He introduced some coding tricks which allow an 8-bit multiply in an average of 160 cycles.  I have reproduced Damon's program below, in lines 1010-1300.  His trick involves eliminating a CLC opcode from the loop in lines 1210-1260.  Ordinarily you would need a CLC before the ADC instruction; Damon decremented the multiplicand by one before starting the loop, so that adding with carry set works.  He does the decrementing in lines 1130-1160.  Note that if the original multiplicand was zero, he skips all the rest of the code and just returns the answer:  0.

I had to go at least one step faster, so I partially "un-wrapped" the 8-step loop.  I changed it to loop only four times, but handled two bits of the multiplier each time.  This runs an average time of 140 cycles.  You could unwrap it all the way, writing out the BCC-ADC-ROR-ROR lines a total of 8 times, and cut the average time down to only 111 cycles.

Let me stop here and say what I mean by average time.  I am stating time in terms of "cycles", rather than seconds or microseconds.  The Apple two different cycle times, depending on the video timing logic.  The average Apple speed is 1020488 cycles per second.  The multiply algorithms will vary in speed depending on the number of bits in the multiplier which equal "1".  If the multiplier = $FF (all ones) the algorithm will take the maximum time.  If the multiplier is $00, it will take the minimum time.  On the average for random arguments, the multiplier will have four zeroes and four ones, so the average time is equal to the average of the minimum and maximum times.  For all of the subroutines, I included the cycles for a JSR to call them, and for the RTS at the end.

I programmed an 8-bit multiply using 65802 opcodes, as shown below in lines 1560-1790.  The program is slightly shorter (one byte), but that really isn't a fair comparison.  The arguments and product are handled differently, and so the effort to call the program may be more or less than that for the 6502 version.  Rather than passing the multiplicand in the X-register, I have it in the A-register.  I pass the multiplier in the high byte of the A-register.  Since X is not used for passing any values, I saved and restored it (lines 1620 and 1770).  I assumed the program would be called from the 6502 mode, which of course it was as long as I was testing it.  In "real life" it might be written to be called from Native 65802 mode, since the larger program it was a part of would also be taking advantage of all the 65802 features.

I used a couple of tricks to save space and time.  One you may justly complain about is that I store the multiplicand directly into the operand field of the ADC instruction at line 1720.  This definitely saves time, but it also could have serious drawbacks.  (For example, it would not work if executed from ROM.)  Since I enter in 6502 Emulation mode, line 1640 only loads 0 into the low byte of the A-register.  Lines 1650-1660 enter the 65802 Native mode.  Line 1680 sets the A-register to 16-bit mode.

In line 1690 I form the inverse (one's complement) of the multiplier.  This is just another way of eliminating the CLC from the loop.  Note that the multiplier is in the high byte of A, and the product is going to be accumulated in the low byte.  The loop runs from line 1700 through line 1740.  Line 1700 shifts to the left both the partial product and what remains of the multiplier, putting the highest remaining bit of the multiplier into the carry status bit.  If that bit = 1, then the original bit in the multiplier before complementing was a zero, so we do not add the multiplicand to the current partial product.  As we continue through the loop, the bits of the multiplier keep shifting out just ahead of the ever-growing partial product, until finally we have the answer.

Lines 1750-1780 restore the machine state to the 6502 Emulation mode and restore the original X-register value.  The full product is now in the A-register.  If I wanted to print out the product, I might do it like this:

       XBA             GET HIGH BYTE INTO LOW-A
       JSR $FDDA       MONITOR PRINT-BYTE SUBROUTINE
       XBA             GET LOW BYTE INTO LOW-A
       JMP $FDDA

Here is a summary of the execution times (in cycles) for the three 8-bit multiply subroutines:

              Minimum  Maximum  Average
        Slye    152       168      160
        RBSC    132       148      140
        65802   119       135      127

The 65802 version would be seven cycles faster if we did not require saving and restoring the X-register.  If you want to change the 65802 version for calling from Native mode, delete lines 1650, 1660, 1750, and 1760.  Then insert the following:

       1612         PHP
       1614         SEP #$30
       ...
       1772         PLP

These changes add one cycle to the time.


       <<<8x8 listings here>>>


I will also show three sample 16-bit multiply subroutines....no, four.  The first one is a copy of Brooke Boering's code.  The second is a direct conversion of Brooke's code to 65802 code, with emphasis on space.  The third modifies the second with the tricks of Damon Slye; it takes more space, but it is faster.

The first three of these subroutines are modeled after the code in the original Apple monitor ROM.  The arguments are expected in page zero locations, low-byte first.  The result will also be in page zero locations.  The function performed is actally a little more than just multiplication, because it is possible to specify an addend as well.  The final result will be PRODUCT = ADDEND + (MULTIPLIER * MULTIPLICAND).  PRODUCT is stored in four consecutive bytes, backwards.  The highest byte is at PRODUCT+1, the next at PRODUCT, the next at PLIER+1, and the lowest at PLIER.  The fourth subroutine differs in that the product does not overlap the multiplier.

Looking at Brooke's version (lines 1000-1270) you can see that the loop contains a 16-bit addition (lines 1130-1190).  There are also two 16-bit ROR shifts, at lines 1200-1230.  These are the likely candidates for shortening via 65802 code.  My first version for the 65802 made no other changes in the loop.  I merely prefixed Brooke's code with CLC-XCE-REP to get into the 16-bit Native mode, and suffixed it with SEC-XCE to get back to Emulation mode.  Then I noticed another shortcut, and the result is in lines 1300-1480.

By moving the LDA PRODUCT up before the BCC opcode in lines 1370-1380, I was able to change a ROR PRODUCT to a simple ROR on the A-register followed by a STA PRODUCT.  This saves a net six cycles when the multiplier bit is "1", and costs two cycles when the multiplier bit is "0".  The average savings for random multipliers is four cycles, inside a loop that runs 16 times.

The faster version, in lines 1500-1780, merely implements Damon Slye's trick of pre-decrementing the multiplicand so as to avoid an explicit CLC opcode inside the 16-time loop.  It costs 12 cycles for the extra setup, but it saves two cycles for each one-bit in the multiplier.

The fourth version, in the separate listing as lines 1000-1430, uses the trick of splitting the multiplier in half.  In effect, two parallel 8-bit by 16-bit multiplies are accomplished, with the result usually taking less time than any of the other algorithms.  By deleting line 1130 (which shaves off another four cycles) the feature of allowing an addend can be included.

Here is a summary of the execution cycles for the four 16-bit multiply subroutines:

               Minimum  Maximum  Average
       Boering   541      845      693
       Smaller   519      599      559
       Faster    531      579      555
       Fourth    332      684      508 (usually fastest)

Note that the third subroutine also goes even faster when the multiplicand = zero, because the bulk of the code is skipped.

These are pretty good subroutines, but I have no doubt that they can be improved upon.  Why not try your hand?  If you can significantly improve either space or time or features, send your code to AAL.  We'll publish the best ones, and help advance the state of the art.  And if you have some classy division subroutines, they are welcome too!


    <<<listings of 16x16 routines>>>>
